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Abstract 

A multidisciplinary, collaborative simulation has been performed on a Grid of geographically distributed PC 
clusters. The multiscale simulation approach seamlessly combines i) atomistic simulation based on the molecular 
dynamics (MD) method and ii) quantum mechanical (QM) calculation based on the density functional theory (DFT) so 
that accurate but less scalable computations are performed only where thev are needed. The multiscale MD/QM 
simulation code has been Grid-enabled using i) a modular, additive hybridization scheme, ii) multiple QM clustering 
and in) computation/communication overlapping. The Gndified MD/QM simulation code has been used to study 
environmental effects of water molecules on fracture in silicon. A preliminary run of the code has achieved a parallel 
efficiency of 94% on 25 PCs distributed over 3 PC clusters in the US and Japan, and a larger test involving 154 
processors on 5 distributed PC clusters is in progress. 43 
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1 Introduction 


P of geographically distributed computing platfomis connected via high-speed networks 

exnirtKe ll r t e H COmP f reSearch ’ by enab ' in§ collaborat * v «. hybrid computations that integrate multiple 
expertise distributed over wide geographical locations [1], The availability of inexpensive PC clusters at the 

research-group level suggests a new collaborative mode for computational research, in which multiple research 

Droeram°s f a d d erSe ex P e ^ ,Se P artlc, P ate m 3 metacomputmg project by providing both expert-maintained^pplication 
programs and computational resources to run them. Such a multidisciplinary application ,s emerging at the forefront 
ompu ational sc ^ nce s and engmeenng. The multiscale simulation embeds accurate quantum mechanical (QM) 

2“ "““ “ * '«"8<h *al t of 10'® m> , lecuh, S 

simulation to describe large-scale atomistic processes (up to a length scale of 10 6 m), see Fig. 1 [2-51. Modem 

esign of high-performance matenals and devices focuses on controlling structures at diverse length scales from 

“ “«T “>■ “« ®“ h ”«'> “* MD/QM simubdonn enpncled p ,„ „ ,m p „, n “fc ” 
scaling down engmeenng concepts to nanometer scales. F 



Figure 1: Multiscale MD/QM simulation of the reaction of water molecules at a crack tip in silicon (top) performed 
on geogra phically distributed PC dustem in the US and Japan In than e„m P l«, QM calculation™ re 

! P h « re ® QM silicon atoms; blue, handshake silicon atoms' 

red, QM oxygen atoms; yellow, QM hydrogen atoms; gray, MD silicon atoms. 


Unfortunately only a limited dass of scientific and engmeenng applications has been shown to scale on such 
Grid computing platforms [7], This paper desenbes our efforts, to enable MD/QM simulations on a Grid by 
designing scalable multiscale simulation algonthms. In the next section, we describe MD and QM algorithms as 
well as their Gnd-enabled hybridization schemes. Section 3 discusses Gnd implementation of the resulting MD/QM 
simulation algonthm. Results of benchmark tests are given in Sec. 4, and Sec. 5 contains conclusions ' 


2 Multiscale simulation algorithms 


^* ha r e ^eloped a multiscale simulation algorithm that combines a classical MD simulation and a 
vM calculation based on the density functional theory (DFT) 
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2.1 Space-time multiresolution molecular dynamics algorithm 

The MD approach obtains the phase-space trajectories (positions and velocities of all atoms at all time) by 
numerically integrating coupled ordinary differential equations. The dynamics is encoded in the interatomic 
potential energy, which is a function of the positions of all Adatoms, r' v = {r 1} r 2 , ..., r tV ), in the system. In 

our many-body interatomic potential scheme, £' MD (r' v ) is an analytic function that depends on relative positions of 
atomic pairs and triples [3]. 

We have developed highly efficient, multiresolution algorithms to carry out large-scale MD simulations on 
parallel computers [8,9]. The most compute-intensive problem in an MD simulation is the 0{N 2 ) computation of the 
electrostatic energy for N charged atoms. We use the Fast Multipole Method (FMM) [10] to reduce the complexity 
to 0(N) by computing the electrostatic field recursively on an octree. In our multiresolution molecular dynamics 
(MRMD) algorithm, the FMM is combined with a symplectic, multiple time-scale (MTS) method [1 1] that applies 
different force-updating schedules for different force components while keeping the phase-space volume occupied 
by the atoms as a simulation loop invariant. 

Our MRMD program also features wavelet-based computational-space decomposition for adaptive load 
balancing [12, 13] and spacefilling-curve-based adaptive data compression for scalable I/O [14]. 

2.2 Real-space quantum-mechanical calculation based on the density functional theory 

Empirical interatomic potentials used in MD simulations fail to describe chemical processes. Instead, interatomic 
interaction in reactive regions needs to be calculated by a QM method that can describe breaking and formation of 
bonds. An atom consists of a nucleus and surrounding electrons, and quantum mechanics explicitly treats the 
electronic degrees-of- freedom. Since each electron’s wave function is a linear combination of multiple states, the 
combinatorial solution space for a many-electron problem is exponentially large. Walter Kohn received a Nobel 
prize in 1998 for the development of a heuristic called the density functional theory (DFT) [15-17]. The DFT avoids 
the exhaustive enumeration of many-electron correlations by solving W wf single-electron problems in a common 
aveiage environment, which is determined self-consistently (N W f is the number of independent wave functions or 
electronic bands). As a result, the problem is reduced to a self-consistent matrix eigenvalue problem, which can be 
solved with 0(N %V i ) operations. The DFT problem can be formulated as the minimization of an energy functional, 
*QM(r w ), with respect to electron wave functions, = {y , (r), y 2 (r ),..., y. v (r)}, subject to 

orthonormalization constraints on the wave functions. 

’ F° r efficient parallel implementation of the DFT, we employ real-space approaches based on higher-order finite 
differencing [18] and multigrid acceleration [19, 20]. Our parallel DFT code [9, 21, 22] includes electron-ion 
interactions using norm-conserving pseudopotentials [23] and the exchange-correlation energy in a generalized 
gradient approximation [24]. 

2.3 Multiscale molecular-dynamics/quantuni-mechanical simulation algorithm 

We have* developed a multiscale MD/QM simulation scheme, in which atomic clusters described by QM calculation 
are embedded in an atomistic region (see Fig. 1) [3-5]. The motion of atoms is described with a real-space 
multignd-based DFT in the QM clusters and with the MD approach in the surrounding region. To make the 
MD/QM simulation scalable on a Grid, we have used the following techniques. 

Additive hybridization. To minimize the modification of existing MD and QM codes, we employ an additive 
hybridization approach based on a linear combination of MD and QM energies [25]. The total energy, £ y is written 
as 


E= 4^ + ^QM^CC r QM ). {>HS })- 4‘fD ter ( ( «Q M ). (r H S }) , 


( 1 ) 


where is the classical MD energy for the entire system and the last two terms represent the QM energy 


collection for the QM cluster: is the QM energy for the cluster (its dangling bonus are terminated by 


hydrogen atoms to provide appropriate boundary conditions), and ^^ tcr is the MD potential energy of the cluster 
(terminated by appropriate MD atoms). In Eq. (1), {r QM } is the set of the positions of the QM atoms, and {r HS } are 
the positions of handshake (HS) atoms, i.e. t classical atoms bonded to the QM atoms (see Fig. 1). In QM 
calculation, each bond connecting a QM atom and a HS atom is terminated by a hydrogen atom. Positions of the 
termination hydrogen atoms are determined dynamically with the scaled-position link atom method as a function of 
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{ r QM } iind {ths/. Other physical quantities, such as interatomic forces are rferTveW fv-nm c /i \ 

sun, of the QM energy corrections for the clusters in the additive ° ^ ^ ** “ 3 

*-*3ST+ I [^QM ter ( ( r Q M } , { r HS i ) ^ 4 'id ter ( f r QM } , f r HS })] . (?) 

duster v ~ y 

computationally more efficient than the single-QM-cluster scheme because of the OWWaline f The T-T “ 

?" f r m DFT *' EOn,hmS — «• «*> algorithms ft*, 



Figure 2: A flowchart of parallel computations in the hybnd MD/QM simulation algorithm. 

exchange of^duste^anam^'oi^ as^hown i^A^flowchart hi MT)^ * * 

compute the energy and forces of the entire system and send 

the QM processor groups. Subsequently, the MD and QM processors independently perform he MD and OM 
computatmns on the atom.c clusters. The QM energy and forces are then returned to the MD processor whe^ 
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total energy and corresponding forces are calculated and the equations of motion are integrated to update the atomic 
positions and velocities. The communications between the MD and QM processors are minimal, since the MD 
processors only need to send several hundred atomic coordinates to each QM cluster, which in return sends back the 
calculated several hundred force components. (This is in contrast to non-additive hybridization schemes, in which 
the QM tasks continuously access the atomic information in the outer region during their computations.) 

Computation/communication overlap. To hide the latency, the communications between the MD and QM 
processors have been overlapped with the computations. Our spatial decomposition scheme splits the computation 
on each processor into the interior and boundary computations. The interior computation is then fully overlapped 
with the communication of the boundary data. 


3 Grid implementation 

The above hybridization scheme is amenable to meta-computing on a Grid. We have implemented the multiscale 
MD/QM simulation algorithm as a single MPI (Message Passing Interface) program. The Globus middleware 
(www.globus.org) and the Grid-enabled MPI implementation, MPICH-G2 (www3.niu.edu/mpi), have been used to 
implement the MPI-based multiscale MD/QM simulation code in a Grid environment. 

In the parallel MD/QM program, all the tasks constitute a single MPI communicator, MPI_COMM_WORLD, 
and processors are grouped into MD and QM groups by defining multiple MPI communicators. (The MD 
calculation, as well as each of the QM-cluster calculations, is assigned a dedicated communicator.) The code is 
written in a single-program multiple-data (SPMD) style, which uses selection statements for the MD and QM 
processors to execute only the corresponding code segments. Dynamic memory allocation/deallocation operations 
in Fortran 90 are used to reduce the memory size. 

In the current implementation, processors on multiple PC clusters are statically allocated using a host file. The 
user specifies the number of processors for each QM-cluster calculation in a configuration file. (On heterogeneous 
platforms, load balancing needs to be manually achieved by assigning more processors on slower platforms.) The 
additive hybridization and multiple QM clustering schemes minimize the inter-task communications, and thus the 
maximal parallel efficiency is achieved by assigning each QM cluster to a single PC cluster. Each PC cluster, on the 
other hand, can have multiple QM clusters without much communication overhead. 

The QM calculation involves significant I/O for saving wave functions of all the electrons. For example, the 
QM calculation of 76 atoms (36 Si and 40 H atoms) produces .24 MBytes of data. (The wave functions in the 
previous 2 MD steps are saved to obtain a good initial guess for the iterative solution by extrapolation.) To avoid 
the I/O bottleneck, these wave functions are saved locally within each PC cluster. 


4 Performance tests 

We have implemented the hybrid MD/QM simulation code on geographically distributed PC clusters in the US and 
Japan. The Linux-based PC clusters used for this project include: 

• A 65-processor Intel Pentium IV 2.0GHz cluster at Yamaguchi University in Japan; 

• Three 24-processor Intel Pentium IV l.8GHz clusters at Hiroshima, Okayama, and Niigata Llniversities in 
Japan; 

• A 17-processor AMD Athlon XP 1900+ cluster at the Louisiana State University (LSU) in the US. 

These PCs are connected using Gigabit Ethernet switches inside each cluster and uplink to campus backbone. Non- 
dedicated networks used for the Grid include the SINET/SuperSINET (OC192), APAN/TransPAC (OC12), Abilene 
(OC48), LaNet (OC3), and campus networks of the participating institutions. However, the measured bandwidth 
between LSU and the other institutions in Japan is low (ranging from 50 to 200 kBytes/s) due to the bottleneck of 
the low-bandwidth campus networks and bandwidth sharing. 

A preliminary benchmark test of the hybrid MD/QM simulation code has been performed on the Grid of PC 
clusters described above. (A larger benchmark test involving 154 processors is being performed on 5 PC clusters in 
the US and Japan.) MD/QM simulations have been performed to study environmental effects of H 2 0 molecules on 
fracture of strained Si, in which atoms near the crack tip are treated quantum-mechanically in the framework of the 
DFT. Significant effects of the stress intensity factor on the reaction of the H 2 0 molecules at the crack tip have been 
thereby observed [5]. The simulated system is a cracked Si with (110) crack surfaces under uniaxial tension, 
containing 91,256 atoms. The total number of QM atoms is 16N c \ mXcr (36 Si and 40 termination-H atoms per cluster) 
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where Adjuster is the total number of QM clusters, 
the crack tip region. 


At least several hundred QM atoms are needed to accurately model 


Figure 3 shows the execution time of the MD/QM simulation code as a function of N clmKT where each OM- 
H ,S perfo ™ ed on an ^-processor PC cluster. The total number of processors is thus P = 1 + 8AQ 1 

MD S “ T In?? 6 ' ° n , 1 PrOCeSSOr) ' In the m ° n 25 P^-cssors, J PC cluster at LSU perfoSe 
, v , e as 1 Q M ' cluster calculation on 9 (- 1 + 8) processors, whereas each of the PC clusters at Nimata 
and Yamaguchi uses 8 processors to perform 1 QM-cluster calculation. & 

Since the system size in the QM calculations scales linearly with the number of processors the constant 

task t ne arlv C Per f S ? at ‘°\ SteP Slgn ‘ f,eS perfeCt S P eedu P- ( As sh °™ in Fig. 3, the execution time for the MD 
task is nearly constant, since the number of MD atoms is kept constant.) We first define the speed of the MD/OM 

program as a product of the total number of QM atoms and time steps executed ,n a second Se sealed 

=IVe " by he ratl ® between the s P eed of Cluster Clusters and that of one cluster. The scaled efficiency is the scaled 
speedup divided b y /V cl On 25 processors (3 PC clusters) in the US and Japan, the 

■ only a si,sht mcrease from 21,2 «■ * pc — » . ^ 


Number of processors 



clusters 3 / 31 ’ 'm tlmC r r “° n fOT the MD/ Q M ^ulmon code as a function of the number of QM 
clusters, AQ (The number of simulated QM atoms is 76/V cluste .) Each QM-cluster calculation is performed on an 

Md Ja^anTs thus iTg’v Th^ is perfonned ™ ' processor. The number of processors in the US 

dencSTy CX ** P *“ ^ *» »< ™ - 

As is evident in Fig. 3, the dominant computation in the MD/QM simulation is the QM task Table I show* the 
averaged wall c ock and communication times per MD step for three different stages of the QM task , InirialTmn 
n) Herat, ve solution of the eigenvalue problem; and iii) force calculation 8 Not 

computationally dominant, but it also involves considerable communication. g 


TABLE L Partial wall clock and communication times per MD step for the OM task. 



Initialization 

Eigenvalue problem 

Force calculation 

Wall clock time (sec) 

16.5 

192.7 

1.5 

Communication (sec) 

r~ i.o 

67.9 

1.3 
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The computation of the QM task scales as OiN 3 IP) for N QM atoms on P processors, whereas the 
communication scales as 0(N(N!P)~ n + N'logP) [22], The communication overhead becomes less for large grain 
sizes, NIP, but could become a bottleneck if a QM cluster is assigned on multiple PC clusters with weak 
communication links. 


5 Conclusions 

We have demonstrated the scalability of multiscale MD/QM simulations on a Grid of geographically distributed PC 
clusters in the US and Japan. The resulting high efficiency (despite the weak communication link) suggests that this 
is an efficient collaboration mode in computational research. Such a multiscale simulation collaborator will consist 
of geographically distributed application specialists who not only maintain their computational modules with most 
updated simulation algorithms but also provide compute servers that are best suited to their own application codes. 
We have shown that the additive multiple-clustering scheme is effective in Gridifying such multiscale simulation 
codes. Furthermore, the additive hybridization scheme allows multiple layers of nested hybridization including, 
e.g., MD, DFT, and higher accuracy QM calculations such as configuration interaction. These methods are 
currently used by separate groups of scientists to solve similar problems at different levels of accuracy and problem 
sizes. The multiscale simulation collaboratory on a Grid will allow these scientists to jointly solve challenging 
scientific/engineering problems. The Gridification approach in this paper should be applicable to other modular 
multiscale simulations as well. An example is the black-box hybridization scheme based on the multilevel Newton 
method, in which the residual equation is formulated in a coupled model space combining, e.g.. electrostatic and 
mechanical problems [26]. 
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